Search CORE

8 research outputs found

Misspecification in Inverse Reinforcement Learning

Author: Abate Alessandro
Skalse Joar
Publication venue
Publication date: 24/03/2023
Field of study

The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function

R

from a policy

\pi

. To do this, we need a model of how

\pi

relates to

R

. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function

R

. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Lexicographic Multi-Objective Reinforcement Learning

Author: Abate Alessandro
Griffin Charlie
Hammond Lewis
Skalse Joar
Publication venue: 'International Joint Conferences on Artificial Intelligence'
Publication date: 28/12/2022
Field of study

In this work we introduce reinforcement learning techniques for solving lexicographic multi-objective problems. These are problems that involve multiple reward signals, and where the goal is to learn a policy that maximises the first reward signal, and subject to this constraint also maximises the second reward signal, and so on. We present a family of both action-value and policy gradient algorithms that can be used to solve such problems, and prove that they converge to policies that are lexicographically optimal. We evaluate the scalability and performance of these algorithms empirically, demonstrating their practical applicability. As a more specific application, we show how our algorithms can be used to impose safety constraints on the behaviour of an agent, and compare their performance in this context with that of other constrained reinforcement learning algorithms

arXiv.org e-Print Archive

Is SGD a Bayesian sampler? Well, almost

Author: Louis Ard A.
Mingard Chris
Skalse Joar
Valle-Pérez Guillermo
Publication venue
Publication date: 24/10/2020
Field of study

Overparameterised deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalise remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error. Here we empirically investigate this inductive bias by calculating, for a range of architectures and datasets, the probability

P_{SGD}(f\mid S)

that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function

f

consistent with a training set

S

. We also use Gaussian processes to estimate the Bayesian posterior probability

P_B(f\mid S)

that the DNN expresses

f

upon random sampling of its parameters, conditioned on

S

. Our main findings are that

P_{SGD}(f\mid S)

correlates remarkably well with

P_B(f\mid S)

and that

P_B(f\mid S)

is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines

P_B(f\mid S)

), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior

P_B(f\mid S)

is the first order determinant of

P_{SGD}(f\mid S)

, there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on

P_{SGD}(f\mid S)

and/or

P_B(f\mid S)

, can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance

arXiv.org e-Print Archive

Oxford University Research Archive

On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning

Author: Griffin Charlie
Heitmann Max
Holm Halfdan
Skalse Joar
Subramani Rohan
Williams Marcus
Publication venue
Publication date: 18/10/2023
Field of study

To solve a task with reinforcement learning (RL), it is necessary to formally specify the goal of that task. Although most RL algorithms require that the goal is formalised as a Markovian reward function, alternatives have been developed (such as Linear Temporal Logic and Multi-Objective Reinforcement Learning). Moreover, it is well known that some of these formalisms are able to express certain tasks that other formalisms cannot express. However, there has not yet been any thorough analysis of how these formalisms relate to each other in terms of expressivity. In this work, we fill this gap in the existing literature by providing a comprehensive comparison of the expressivities of 17 objective-specification formalisms in RL. We place these formalisms in a preorder based on their expressive power, and present this preorder as a Hasse diagram. We find a variety of limitations for the different formalisms, and that no formalism is both dominantly expressive and straightforward to optimise with current techniques. For example, we prove that each of Regularised RL, Outer Nonlinear Markov Rewards, Reward Machines, Linear Temporal Logic, and Limit Average Rewards can express an objective that the others cannot. Our findings have implications for both policy optimisation and reward learning. Firstly, we identify expressivity limitations which are important to consider when specifying objectives in practice. Secondly, our results highlight the need for future research which adapts reward learning to work with a variety of formalisms, since many existing reward learning methods implicitly assume that desired objectives can be expressed with Markovian rewards. Our work contributes towards a more cohesive understanding of the costs and benefits of different RL objective-specification formalisms

arXiv.org e-Print Archive

Goodhart's Law in Reinforcement Learning

Author: Bai Xingjian
Griffin Charlie
Hayman Oliver
Karwowski Jacek
Kiendlhofer Klaus
Skalse Joar
Publication venue
Publication date: 13/10/2023
Field of study

Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification

arXiv.org e-Print Archive

Neural networks are a priori biased towards Boolean functions with low entropy

Author: Louis Ard A.
Martínez-Rubio David
Mikulik Vladimir
Mingard Chris
Skalse Joar
Valle-Pérez Guillermo
Publication venue
Publication date: 02/01/2020
Field of study

Understanding the inductive bias of neural networks is critical to explaining their ability to generalise. Here, for one of the simplest neural networks -- a single-layer perceptron with n input neurons, one output neuron, and no threshold bias term -- we prove that upon random initialisation of weights, the a priori probability P(t) that it represents a Boolean function that classifies t points in {0,1}^n as 1 has a remarkably simple form: P(t) = 2^{-n} for 0\leq t < 2^n. Since a perceptron can express far fewer Boolean functions with small or large values of t (low entropy) than with intermediate values of t (high entropy) there is, on average, a strong intrinsic a-priori bias towards individual functions with low entropy. Furthermore, within a class of functions with fixed t, we often observe a further intrinsic bias towards functions of lower complexity. Finally, we prove that, regardless of the distribution of inputs, the bias towards low entropy becomes monotonically stronger upon adding ReLU layers, and empirically show that increasing the variance of the bias term has a similar effect

arXiv.org e-Print Archive

Safety Properties of Inductive Logic Programming

Author: Leech Gavin
Schoots Nandi
Skalse Joar
Publication venue
Publication date: 01/01/2021
Field of study

Explore Bristol Research